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Abstract: Bandit based methods for tree search have recently gained popularity when 
applied to huge trees, e.g. in the game of go (GWMT06 . The UCT algorithm |KS06j . a 
tree search method based on Upper Confidence Bounds (UCB) [ACBF02 , is believed to 
adapt locally to the effective smoothness of the tree. However, we show that UCT is too 
"optimistic" in some cases, leading to a regret f2(exp(exp(D))) where D is the depth of 
the tree. We propose alternative bandit algorithms for tree search. First, a modification 
of UCT using a confidence sequence that scales exponentially with the horizon depth is 
proven to have a regret 0(2 D \/n) : but does not adapt to possible smoothness in the tree. 
We then analyze Flat- U CB performed on the leaves and provide a finite regret bound with 
high probability. Then, we introduce a UCB-based Bandit Algorithm for Smooth Trees 
which takes into account actual smoothness of the rewards for performing efficient "cuts" of 
sub-optimal branches with high confidence. Finally, we present an incremental tree search 
version which applies when the full tree is too big (possibly infinite) to be entirely represented 
and show that with high probability essentially only the optimal branches is indefinitely 
developed. We illustrate these methods on a global optimization problem of a Lipschitz 
function, given noisy data. 
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Resume : Les methodes de recherche arborescentes utilisant des algorithmes de bandit 
ont recemment connu une forte popularity, pour leur capacite de traiter des grands arbres, 
par exemple pour le jeu de go GWMT06 . II est connu que l'algorithme UCT KS06 , une 
methode de recherche arborescente basees sur des intervalles de confiance (algorithme Upper 
Confidence Bounds (UCB) de [ACBF02J, s'adapte locallement a la profondeur effective de 
l'arbre. Cependant, nous montrons ici que UCT peut etre trop "optimiste" dans certains 
cas, menant a un regret f2(exp(exp(Z?))) ou D est la profondeur de l'arbre. Nous proposons 
plusieurs alternatives d'algorithmes de bandit pour la recherche arborescente. Tout d'abord, 
nous proposons une modification d'UCT utilisant un intervalle de confiance qui croit expo- 
nentiellement avec la profondeur de l'horizon de l'arbre, et montrons qu'il mene a un regret 
0(2 D v / n) mais ne s'adapte pas a la regularity de l'arbre. Puis nous analysons un algorithme 
Flat- UCB de bandit de type UCB directement sur les feuilles et prouvons une borne finie 
(independante de n) sur le regret avec forte probabilite. Ensuitc, nous introduisons un algo- 
rithme Bandit Algorithm for Smooth Trees qui prend en compte d'eventuelles regularites dans 
l'arbre pour realiser des "coupes" efficaces de branches sous-optimale avec grande confiance. 
Enfin, nous presentons une version incrementale de recherche arborescente qui s'applique 
lorsque l'arbre est trop grand (voire infini) pour pouvoir etre represents entierement, et mon- 
trons qu'essentiellement, et avec forte probabilite, seule la branche optimale est indefiniment 
developpee. Nous illustrons ces methodes sur un problems d'optimisation d'une fonction 
Lipschitzienne, a partir de donnees bruitees. 

Mots-cles : Algorithmes de bandit, recherche arborescente, compromis exploration- 
exploitation, bornes superieures d'intervalles de confiance, jeux minimax, apprentissage par 
rcnforcement 
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1 Introduction 

Bandit algorithms have been used recently for tree search, because of their efficient trade-off 
between exploration of the most uncertain branches and exploitation of the most promising 
ones, leading to very promising results in dealing with huge trees (e.g. the go program 
MoGo, see GWMT06 ). In this paper we focus on Upper Confidence Bound (UCB) bandit 
algorithms [ACBF02] applied to tree search, such as UCT (Upper Confidence Bounds applied 
to Trees) |KS06] . The general procedure is described by Algorithm [T] and depends on the 
way the upper-bounds Bi iPiUi for each node i are maintained. 



Algorithm 1 Bandit Algorithm for Tree Search 
for n > 1 do 

Run n-th trajectory from the root to a leaf: 

Set the current node io to the root 
for d = 1 to D do 

Select node id as the children j of node id-i that maximizes B^ 
end for 

Receive reward x n ~ Xi D 

Update the nodes visited by this trajectory: 
for d = D to do 

Update the number of visits: rii d = rii d + 1 
Update the bound B^n^ ^n^ 
end for 



A trajectory is a sequence of nodes from the root to a leaf, where at each node, the 
next node is chosen as the one maximizing its B value among the children. A reward is 
received at the leaf. After a trajectory is run, the B values of each node in the trajectory 
are updated. In the case of UCT, the upper-bound Bi tPyTli of a node i, given that the node 
has already been visited rij times and its parent's node p times, is the average of the rewards 
{xt}i<t<m obtained from that node A^. ni = — J27=i x * Pl us a confidence interval, derived 
from a Chernoff-Hoeffding bound (see e.g. |GDL96| ): 



In this paper we consider a max search (the minimax problem is a direct generalization 
of the results presented here) in binary trees (i.e. there are 2 actions in each node), although 
the extension to more actions is straightforward. Let a binary tree of depth D where at each 
leaf i is assigned a random variable Xj, with bounded support [0, 1], whose law is unknown. 
Successive visits of a leaf i yield a sequence of independent and identically distributed (i.i.d.) 
samples Xij ~ called rewards, or payoff. The value of a leaf i is its expected reward: 



end for 




(1) 
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def 

fii = EA^ . Now we define the value of any node i as the maximal value of the leaves in the 
branch starting from node i. Our goal is to compute the value fx* of the root. 

An optimal leaf is a leaf having the largest expected reward. We will denote by * 
quantities related to an optimal node. For example fi* denote max^ /ij. An optimal branch 
is a sequence of nodes from the root to a leaf, having the fx* value. We define the regret up 
to time n as the difference between the optimal expected payoff and the sum of obtained 
rewards: 

n 
t=l 

where i t is the chosen leaf at round t. We also define the pseudo-regret up to time n: 

n 

where C is the set of leaves, Aj d = /j* — [ij, and rij is the random variable that counts 
the number of times leaf j has been visited up to time n. The pseudo-regret may thus be 
analyzed by estimating the number of times each sub-optimal leaf is visited. 

In tree search, our goal is thus to find an exploration policy of the branches such as to 
minimize the regret, in order to select an optimal leaf as fast as possible. Now, thanks to a 
simple contraction of measure phenomenon, the regret per bound R n /n turns out to be very 
close to the pseudo regret per round R n /n. Indeed, using Azuma's inequality for martingale 
difference sequences (see Proposition [T]), with probability at least 1 — /3, we have at time n, 



n V n 

The fact that R(n)—R n is a martingale difference sequence comes from the property that, 
given the filtration Tt-\ defined by the random samples up to time t—1, the expectation of 
the next reward Ext is conditioned to the leaf it chosen by the algorithm: E^tlJ^-i] = fi{ t . 
Thus R rL — R n = y^f_j %t — Hi t with E[a; t — Hi t \J- t -i} = 0. Hence, we will only focus on 
providing high probability bounds on the pseudo-regret. 

First, we analyze the UCT algorithm defined by the upper confidence bound |T]). We show 
that its behavior is risky and may lead to a regret as bad as f2(exp(- • • exp(D) • • • )) (D — 1 
composed exponential functions). We modify the algorithm by increasing the exploration 
sequence, defining: 



R*i,v,n; — Xi n - ~\~ \ • (2) 

This yields an improved worst-case behavior over regular UCT, but the regret may still 
be as bad as f2(exp(exp(D))) (see Section[2|). We then propose in Section[3]a modified UCT 
based on the bound ([2|), where the confidence interval is multiplied by a factor that scales 
exponentially with the horizon depth. We derive a worst-case regret 0(2 D / y/n) with high 
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probability. However this algorithm does not adapt to the effective smoothness of the tree, 
if any. 

Next we analyze the Flat-UCB algorithm, which simply performs UCB directly on the 
leaves. With a slight modification of the usual confidence sequence, we show in Section 
0] that this algorithm has a finite regret 0(2 D /A) (where A = minj a ( >o Ai) with high 
probability. 

In Section O we introduce a UCB-based algorithm, called Bandit Algorithm for Smooth 
Trees, which takes into account actual smoothness of the rewards for performing efficient 
"cuts" of sub-optimal branches based on concentration inequality. We give a numerical 
experiment for the problem of optimizing a Lipschitz function given noisy observations. 

Finally, in Section [H] we present and analyze a growing tree search, which builds incre- 
mentally the tree by expanding, at each iteration, the most promising node. This method 
is memory efficient and well adapted to search in large (possibly infinite) trees. 

Additional notations: Let C denotes the set of leaves and S the set of sub-optimal leaves. 
For any node i, we write C{i) the set of leaves in the branch starting from node i. For any 
node i, we write «i the number of times node i has been visited up to round n, and we 
define the cumulative rewards: 



2 Lower regret bound for UCT 

The UCT algorithm introduced in [KS06J is believed to adapt automatically to the effective 
(and a priori unknown) smoothness of the tree: If the tree possesses an effective depth d < D 
(i.e. if all leaves of a branch starting from a node of depth d have the same value) then its 
regret will be equal to the regret of a tree of depth d. First, we notice that the bound (fTJ) is 
not a true upper confidence bound on the value \Xi of a node i since the rewards received at 
node i are not identically distributed (because the chosen leaves depend on a non-stationary 
node selection process). However, due to the increasing confidence term log(p) when a node 
is not chosen, all nodes will be infinitely visited, which guarantees an asymptotic regret of 
0(log(n)). However the transitory phase may last very long. 




the cumulative expected rewards: 




and the pseudo-regret: 




ie£(t) 
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Indeed, consider the example illustrated in Figure [TJ The rewards are deterministic and 
for a node of depth d in the optimal branch (obtained after choosing d times action 1), if 
action 2 is chosen, then a reward of is received (all leaves in this branch have the same 
reward). If action 1 is chosen, then this moves to the next node in the optimal branch. At 
depth D — 1, action 1 yields reward 1 and action 2, reward 0. We assume that when a node 
is visited for the first time, the algorithm starts by choosing action 2 before choosing action 
1. 




Figure 1: A bad example for UCT. From the root (left node), action 2 leads to a node from 
which all leaves yield reward -^jr". The optimal branch consists in choosing always action 
1, which yields reward 1. In the beginning, the algorithm believes the arm 2 is the best, 
spending most of its times exploring this branch (as well as all other sub-optimal branches). 
It takes f2(exp(exp(D))) rounds to get the 1 reward! 

We now establish a lower bound on the number of times suboptimal rewards are received 
before getting the optimal 1 reward for the first time. Write n the first instant when the 
optimal leaf is reached. Write rid the number of times the node (also written d making a 
slight abuse of notation) of depth d in the optimal branch is reached. Thus n — hq and 
tid = 1- At depth D — 1, we have ub-i = 2 (since action 2 has been chosen once in node 

We consider both the logarithmic confidence sequence used in ([1]) and the square root 
sequence in @. Let us start with the square root confidence sequence ©. At depth d— 1, 
since the optimal branch is followed by the n-th trajectory, we have (writting d' the node 
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resulting from action 2 in the node d — 1): 



•V/' „ . - \l < X d .n d + \ ■ 

rid' V n d 

But X d > : n d , = (D — d)/D and X dtTld < (-D — (d+ 1))/D since the 1 reward has not been 
received before. We deduce that 

1 




rid 

2 /r>4 



Thus for the square root confidence sequence, we have n d _i > n d /D . Now, by induction 



n > — t > 4 r > — 77T^ r > ■ ■ • > 



2^-1 
l D-l 



D 4 - £)4(l+2) - £)4(l+2+3) - - D 2D(D-1) 

2 2D ~ 1 

Since fic_i = 2, we obtain n > D2r)(g _ 1) . This is a double exponential dependency 
w.r.t. D. For example, for D — 20, we have n > 10 156837 . Consequently, the regret is also 
rj(exp(exp(L>))). 

Now, the usual logarithmic confidence sequence defined by ([T]) yields an even worst 
lower bound on the regret since we may show similarly that n d -i > exp(n d / (2D 2 )) thus 
n > exp(exp(- • • exp(2) • • • )) (composition of D — 1 exponential functions). 

Thus, although UCT algorithm has asymptotically regret 0(log(n)) in n, (or O(yfn) for 
the square root sequence), the transitory regret is f2(exp(exp(- ' ' exp(2) •••))) (or f2(exp(exp(D))) 
in the square root sequence). 

The reason for this bad behavior is that the algorithm is too optimistic (it does not 
explore enough and may take a very long time to discover good branches that looked initially 
bad) since the bounds (HJ and ^ are not true upper bounds. 



3 Modified UCT 

We modify the confidence sequence to explore more the nodes close to the root that the 
leaves, taking into account the fact that the time needed to decrease the bias (pi — E[Xj )n J) 
at a node i of depth d increases with the depth horizon (D — d). For such a node i of depth 
d, we define the upper confidence bound: 



Bi , ni ^ Xi, ni + (kd + l)J 2 -^^ + k A (3) 

V rii rii 



where B n d = „ AT f . . ; with N = 2 D+1 — 1 the number of nodes in the tree, and the 
coefficients: 

k d d ^ i±^[(l + V2)^-l] (4) 
k'd = (3 D - d -l)/2 
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Notice that we used a simplified notation, writing Bi t1li instead of -Bi jP , ni since the bound 
does not depend on the number of visits of the parent's node. 

Theorem 1. Let (3 > 0. Consider Algorithm]^ with the upper confidence bound (0). Then, 
with probability at least 1 — 0, for all n > I, the pseudo-regret is bounded by 



1 + V2, 



Rn < [(1 + V2) D - 1] v /21og(^ 1 )r 

Proof. We first remind Azuma's inequality (see GDL96J): 

Proposition 1. Let f be a Lips chitz function of n independent random variables such that 
f — E/ = Y^h=i di where (<ii)i<j<n * s a martingale difference sequence, i.e.: E[dj|.Fj_i] = 0, 
1 < i < n, such that ||d»||oo < l/n. Then for every e > 0, 

P(|/-E/| >e) <2exp(-ne 2 /2). 

We apply Azuma's inequality to the random variables Yi tUi and Z^ ni , defined respectively, 
for all nodes i, by Yi <rii = f Xi <rH — Xi >ni , and for all non-leaf nodes i, by 

Tli 1 2 

where i\ and 12 denote the children of i. 

Since at each round t < n^, the choice of the next leaf only depends on the random 
samples drawn at previous times s < t, we have E[Y;. t |^t-i] = and E[Z;. t |^t-i] = 0, (i.e.. 
Y and Z are martingale difference sequences), and Azuma's inequality gives that, for any 
node i, for any m, for any e > 0, P(|Y iiTli | > e) < 2exp ( - n 4 e 2 /2) and P(|Z^„J > e) < 
2exp ( - n. l e 2 /2). 

We now define a confidence level c ni such that with probability at least 1 — [3, the random 
variables Y^ ni and Z^ ni belong to their confidence intervals for all nodes and for all times. 
More precisely, let £ be the event under which, for all ni > 1, for all nodes i, \Yi ini \ < c ni 
and for all non-leaf nodes i, \Zi iTLi \ < c nj . Then, by defining 



dcf 



21og(^7 1 ) _ . dcf (3 



c n = W , with/3. 



2JVn(n + l)' 



the event £ holds with probability at least 1 — /3. 

Indeed, from an union bound argument, there arc at most 2N inequalities (for each node, 
one for Y, one for Z) of the form: 

p(W,„,|>c„v„,>i)<y : t ' =±. 

7li>l 
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We now prove Theorem [T] by bounding the pseudo- regret under the event £. We show 
by induction that the pseudo-regret at any node j of depth d satisfies: 

Rj,nj 5~ k c irijC nj + k d , (5) 

This is obviously true for d = D, since the pseudo-regret is zero at the leaves. Now, let a 
node i of depth d — 1. Assume that the regret at the childen's nodes satisfies §5§ (for depth 
d). For simplicity, write 1 the optimal child and 2 the sub-optimal one. Write 

c d n = {k d + l)c + k'Jn 

the confidence interval defined by the choice of the bound ^ at a node of depth d. 

If at round n, the node 2 is chosen, this means that X\, ni + cj* < X 2 ,n 2 + c n 2 - Now, 
since \Z i<ni \ < c„ t , we have n 2 (X 2 , n2 - X 2l n 2 ) < niY 1}Tll + n t c nz , thus: 

ri2{X 1<ni + c£j < niY 1<ni + n 2 (A > 2i „ 2 + cf l2 ) + n^. 

Now, since, l^nj < c ni , we deduce that: 

n2(XL,m - ^2,n 2 - c ni + c^J < men! + n 2 c^ 2 + n 4 c ni . 

Now, from the definitions of X and R, we have: 

V V Rl./rn ( Rl,n 2 \ _ x Rl./rn ^2,n 2 
-^■l.ni — A 2,ra 2 — Ml (M2 J — 1 ) 

ni n 2 ' ni n 2 

dcf 

where Aj = /ii — /i2. Thus, if action 2 is chosen, we have: 

I a "l.fii , ^?2.n 2 d \ ^ d 

n 2 (Ai 1 ■ c ni +c ) < n!C ni + niC„, + n 2 c . 

rt\ n 2 

From the definition of cjj and the assumption ([5]) on -Ri ;rai , we have — Ri >ni jn\ — c ni + cj[ > 
0, thus 

n 2 < [nic ni + n,iC ni + n 2 (k d + l)c„ 2 + k' d ] /Aj 
< [(1 + \/2 + /c d )n lClli + fc^] /A, ; . 

Thus if n 2 > [(1 + \/2 + kd)riiC ni + k' d ]/Ai, the arm 2 will never be chosen any more. We 
deduce that for all n > 0, n 2 < [(1 + \[2 + kd)riiC ni + k d ]/Ai + 1. 
Now, the pseudo-regret at node i satisfies: 

Ri,rn < Rl,m + R2,n 2 + n 2&i 

< (1 + V2){1 + k d )n,c nt + 3k' d + A ; , 
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which is of the same form as © with 

k d -i = (1 + >/2)(l + k d ) 
k'd-i — 3k' d + 1. 

Now, by induction, given that ko = and k' D = 0, we deduce the general form ([J]) of kd 
and k' d and the bound on the pseudo-regret at the root for d = 0. □ 

Notice that the B values defined by ^ are true upper bounds on the nodes value: under 
the event £, for all node i, for all »j > 1, fa < Bi tUi . Thus this procedure is safe, which 
prevents from having bad behaviors for which the regret could be disastrous, like in regular 
UCT. However, contrarily to regular UCT, in good cases, the procedure does not adapt to 
the effective smoothness in the tree. For example, at the root level, the confidence sequence 
is 0(exp(D)/ y/n) which lead to almost uniform sampling of both actions during a time 
0(exp(D)). Thus, if the tree were to contain 2 branches, one only with zeros, one only with 
ones, this smoothness would not be taken into account, and the regret would be comparable 
to the worst-case regret. Modified UCT is less optimistic than regular UCT but safer in a 
worst-case scenario. 



4 Flat UCB 

A method that would combine both the safety of modified UCT and the adaptivity of regular 
UCT is to consider a regular UCB algorithm on the leaves. Such a flat UCB could naturally 
be implemented in the tree structure by defining the upper confidence bound of a non-leaf 
node as the maximal value of the children's bound: 



where we use 



We deduce: 



B . ni def | X itni + yj 2 -^l if i is a leaf, (6) 
[ max[B iunil ,Bi 3tni3 ] otherwise. 

dof (3 



0n 



2 D n(n+ 1)' 



Theorem 2. Consider the flat UCB defined by Algorithm^ and (0). Then, with probability 
at least 1 — (3, the pseudo-regret is bounded by a constant: 

2-D+i 2 D 2 D+1 



where S is the set of sub-optimal leaves, i.e.. S = {i G C, Aj > 0}, and A = mini 6l s Aj. 
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Proof. Consider the event £ under which, for all leaves i, for all n > 1, we have \X^ n — (m\ < 



c n , with the confidence interval c n — y 2 logt ^ n 1 , Then, the event £ holds with probability 
at least 1 — (3. Indeed, as before, using an union bound argument, there are at most 2 D 
inequalities (one for each leaf) of the form: 

PpQ, n - Mi | >c n ,Vn>l)<^ 



2 D n(n + 1) 
►1 ' 



Under the event £, we now provide a regret bound by bounding the number of times each 
sub-optimal leaf is visited. Let i £ S be a sub-optimal leaf. Write * an optimal leaf. If at 
some round n, the leaf i is chosen, this means that X* n , +Cn„ < X^ ni + c ni . Using the (lower 
and upper) confidence interval bounds for leaves i and *, we deduce that /x* < /x, + 2c ni . 

Thus (^y < S ^- J . Hence, for all n > 1, rii is bounded by the smallest integer m such 

kjfe > 8/Af. Thus log(2 ,^-^ 1) ,- 



that T^-tt > 8/Af. Thus , (20 < writing w = 8/Af. This implies 



m < l + wlog(2 D nf/3~ 1 ) (7) 

A first rough bound yields rij < w 2 2' D_2 /3 _1 , which can be used to derive a tighter upper 
bound on m. After two recursive uses of ((7]) we obtain: 

rii < 5wlog(w2 D - 2 /3" 1 ). 

Thus, for all n > 1, the number of times leaf i is chosen is at most 40 log(2 £>+1 /3 _1 /Af )/Af . 
The bound on the regret follows immediately from the property that R n = X^es n i^i- 

This algorithm is safe in the same sense as previously define, i.e. with high probability, 
the bounds defined by ((6|) are true upper bounds on the value on the leaves. However, since 
there are 2 D leaves, the regret still depends exponentially on the depth D. 

Remark 1. Modified UCT has a regret 0(2 D v / n) whereas Flat UCB has a regret 0(2 D /A). 
The non dependency w.r.t. A in Modified UCT, obtained at a price of an additional yfn 
factor, comes from the application of Azuma's inequality also to Z , i.e. the difference between 
the children's deviations X — X. An similar analysis in Flat UCB would yield a regret 
0{2 D ^i). 

In the next section, we consider another UCB-based algorithm that takes into account 
possible smoothness of the rewards to process effective "cuts" of sub-optimal branches with 
high confidence. 



5 Bandit Algorithm for Smooth Trees 

We want to exploit the fact that if the leaves of a branch have similar values, then a 
confidence interval on that branch may be made much tighter than the maximal confidence 
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interval of its leaves (as processed in the Flat UCB). Indeed, assume that from a node i, all 
leaves j € C(i) in the branch i have values fij, such that fj,i — [ij < 8. Then, 



and thanks to Azuma's inequality, the term _ ni — A^ ni is bounded with probability 1 — f3 

by a confidence interval \J 2l ° s ^ — - which depends only on (and not on rij for j £ £(«))• 
We now make the following assumption on the rewards: 

Smoothness assumption: Assume that for all depth d < D, there exists 8d > 0, such 
that for any node i of depth d, for all leaves j € C(i) in the branch i, we have \i\ — fij < 84. 

Typical choices of the smoothness coefficients 64 are exponential Sd = 8^ d (with 8 > 
and 7 < 1), polynomial 8d d = 8d a (with a < 0), or linear Sd *= 8{D — d) (Lipschitz in the 
tree distance) sequences. 

We define the Bandit Algorithm for Smooth Trees (BAST) by Algorithm[T]with the upper 

dcf 

confidence bounds defined, for any leaf i, by Bi iTli — Xi, rii + c„ ; , and for any non-leaf node 
i of depth d, by 

Bi, n , = f min|max [Bi^ nil ,Sj 2i „ i2 ],A^ ni + 8 d + c rii } (8) 
with the confidence interval 



dcf ,/21og(A^n(n-t-l)/3- 1 ) 



We now provide high confidence bounds on the number of times each sub-optimal node 
is visited. 

Theorem 3. Let I denotes the set of nodes i such that Ai > Sd i; where di is the depth of 
node i. Define recursively the values Ni associated to each node i of a sub-optimal branch 
(i.e. for which Ai > 0): 

- If i is a leaf, then 

dcf 401og(2JV/3- 1 /A?) 
A? 

- It i is not a leaf then 



f N it +N i2 , ifi^I 

N i = { ■ , AT , 401og(2Ar/3- 1 /(A,-A- d .) 2 ), . . . 



where i\ and %i are the children nodes of i. Then, with probability 1 — (3, for all n> 1, for 
all sub-optimal nodes i, iii < Ni. 
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Proof. We consider the event £ under which \Xi >n — Xi -n \ < c n for all nodes i and all times 
n > 1. The confidence interval c n — \J ! 2lo s( Nn (n+ 1 )f 3 T j s cnosen such that P (£ ) > 1 — (3. 
Under £, using the same analysis as in the Flat UCB, we deduce the bound rii < Ni for any 
sub-optimal leaf i. 

Now, by backward induction on the depth, assume that rii < Ni for all sub-optimal 
nodes of depth d+ 1. Let i be a node of depth d. Then rii < n il + n i2 < A 7 ^ + N i2 . 

Now consider a sub-optimal node i € I. If the node i is chosen at round n, the form 
of the bound ((8|) implies that for any optimal node *, we have B*, n , < Bi >Tli . Under £, 

< B *,n, and S i)ni < X i)nj + <5 d + c n> < ^ + 8 d + 2c ni . Thus fi* < [i t + 5 d + 2c„ t , which 
rewrites Aj — <5d < 2c„ ; . Using the same argument as in the proof of Flat UCB, we deduce 
that for all ft > 1, we have ni < 401og ( 2 ^ _ ^i) ^ xhus < ATj at depth d, which 
finishes the inductive proof. □ 

Now we would like to compare the regret of BAST to that of Flat UCB. First, we expect 
a direct gain for nodes i £ I. Indeed, from the previous result, whenever a node i of 
depth d is such that A, > S d , then this node will be visited, with high probability, at most 
0(l/(Ai — 6d) 2 ) times (neglecting log factors). But we also expect an upper bound on rii 
whenever A, > if at a certain depth h £ [d, D] , all nodes j of depth h in the branch i 
satisfy Aj > 8h- 

The next result enables to further analyze the expected improvement over Flat UCB. 

Theorem 4. Consider the exponential assumption on the smoothness coefficients : 8 d < 
Sj d . For any rj > define the set of leaves 1^ = {i € S, A, < ?/}. Then, with probability at 
least 1 — P, the pseudo regret satisfies, for all n > 1, 

- v-^ 40, lT . 320(28)°, ,2A\ 

i£l v ,Ai>0 1 1 1 r 

lnlr ./ 1 , , 2A r 8(2<5) c , ,2N x \ 

where 

C d ^log(2)/log(l/ 7 ). 

Note that this bound (Jl]) does not depend explicitly on the depth D. Thus we expect 
this method to scale nicely in big trees (large D). The first term in the bound is the same 
as in Flat UCB, but the sum is performed only on leaves i € I v whose value is 77-close to 
optimality. Thus, BAST is expected to improve over Flat UCB (at least as expressed by 
the bounds) whenever the number \I V \ of ^-optimal leaves is small compared to the total 
number of leaves 2 D . In particular, with r\ < A, \I V \ equals the number of optimal leaves, 
so taking r\ — > A we deduce a regret 0(1/A 2+C ). 

Proof. We consider the same event £ as in the proof of Theorem [3l Call "^-optimal" a 
branch that contains at least a leaf in I v . Let i be a node, of depth d, that does not belong 
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to an ?7-optimal branch. Let h be the smallest integer such that 5h < f]/2, where 5h — Sj h . 
We have h < ^^ij^ + 1- Let j be a node of depth h in the branch i. Using similar 
arguments as in Theorem [51 the number of times rij the node j is visited is at most 

401og(2jVr 1 /(A 3 -4) 2 ) 

but since Aj — 8h > Aj — Sh > Aj — 77/2 > 77/2, we have: rij < V 7 ? 2 ; writing Z = 
1601og(87V/3-Vr7 2 ). 

Now the number of such nodes j is at most 2 h ~ d , thus: 

n, < 2 h ~ d -L < 2 (^)™2- d 4 = 2l{26 T2~ d 



rj 2 \ rj J r) 2 fjc+2 



with c = log(2)/ log(l/7). Thus, the number of times 77-optimal branches are not followed 
until the 77-optimal leaves is at most 



l^'l „c+2 Z - \ X V\ c+2 ■ 

d=l 1 1 

Now for the leaves i G I n , we derive similarly to the Flat UCB that m < 40 log(27V/3 _1 /Af ) /Af . 
Thus, the pseudo regret is bounded by the sum for all sub-optimal leaves i € 1^ of rtiA^ 
plus the sum of all trajectories that do not follow 77-optimal branches until 77-optimal leaves 
I I v I ^ppr- ■ This implies © . □ 

Remark 2. Notice that if we choose § = 0, Z/ien BAST algorithm reduces to regular UCT 
(with a slightly different confidence interval), whereas if S — 00, then this is simply Flat 
UCB. Thus BAST may be seen as a generic UCB-based bandit algorithm for tree search, 
that allows to take into account actual smoothness of the tree, if available. 



Numerical experiments: global optimization of a noisy function. We search the 
global optimum of an [0, l]-valued function, given noisy data. The domain [0, 1] is uniformly 
discretized by 2 D points {yj}, each one related to a leaf j of a tree of depth D. The tree 
implements a recursive binary splitting of the domain. At time t, if the algorithm selects a 

leaf j , then the (binary) reward Xt * ~ ' B(f(yj)), a Bernoulli random variable with parameter 
f(y 3 ) (i.e. P (x t = 1) = /(%), P fa = 0) = 1 - f{y 3 )). 

We assume that / is Lipschitz. Thus the exponential smoothness assumption 64 = 52 d 
on the rewards holds with 6 being the Lipschitz constant of / and 7=1/2 (thus c = 1). In 
the experiments, we used the function 

f{x) = f I sin(47ra) + cos(x)|/2 

plotted in Figure El Note that an immediate upper bound on the Lipschitz constant of / is 
(4tt + 1)/2 < 7. 
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Figure 2: Function / rescaled (in plain) and proportion rij/n of leaves visitation for BAST 
with 5 = 7, depth D = 10, after n = 10 4 and n = 10 6 rounds. 

We compare Flat UCB and BAST algorithm for different values of 8. Figure [3] show 
the respective pseudo- regret per round R n /n for BAST used with a good evaluation of the 
Lipschitz constant (5 = 7), BAST used with a poor evaluation of the Lipschitz constant 
(5 = 20), and Flat UCB (5 — oo). As expected, we observe that BAST outerforms Flat 
UCB, and that the performance of BAST is less dependent of the size of the tree than 
Flat UCB. BAST with a poor evaluation of S still performs better than Flat UCB. BAST 
concentrates its ressources on the good leaves: In Figure [5] we show the proportion rij/n of 
visits of each leaf j. We observe that, when n increases, the proportion of visits concentrates 
on the leaves with highest / value. 



Remark 3. If we know in advance that the function is smooth (e.g. of class C 2 with bounded 
second order derivative) , then one could use Taylor's expansion to derive much tighter upper 
bounds, which would cut more efficiently sub- optimal branches and yield improved perfor- 
mance. Thus any a priori knowledge about the tree smoothness could be taken into account 
in the BAST bound (Ep. 
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Figure 3: Pseudo regret per round R n /n for n = 10 6 , as a function of the depth D 6 
{5, 10, 15, 20}, for BAST with c5 G {7, 20} and Flat UCB. 

6 Growing trees 

If the tree is too big (possibly infinite) to be represented, one may wish to discover it 
iteratively, exploring it as the same time as searching for an optimal value. We propose 
an incremental algorithm similar to the method described in ICou06| and (GWMT06J: The 
algorithm starts with only the root node. Then, at each stage n it chooses which leaf, call 
it i, to expand next. Expanding a leaf means turning it into a node i, and adding in our 
current tree representation its children leaves i\ and i%, from which a reward (one for each 
child) is received. The process is then repeated in the new tree. 

We make an assumption on the rewards: from any leaf j, the received reward x is a 
random variable whose expected value satisfies: jij — K[x] < Sd, where d is the depth of j, 
and fj,j the value of j (defined as previously by the maximum of its children values). 

Such an iterative growing tree requires a amount of memory 0(n) directly related to the 
number of exploration rounds in the tree. 

We now apply BAST algorithm and expect the tree to grow in an asymmetric way, 
expanding in depth the most promising branches first, leaving mainly unbuilt the sub- 
optimal branches. 

Theorem 5. Consider this incremental tree method using Algorithm]]] defined by the bound 
p]) with the confidence interval 



del /21og(n(n + l)n i (n i + 1)^" 1 ) 
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where n is the current number of expanded nodes. Then, with probability 1 — (3, for any 
sub-optimal node i (of depth d), i.e. s.t. A, > 0, 

Thus this algorithm essentially develops the optimal branch, i.e. except for 0(log(n)) 
samples at each depth, all computational resources are devoted to further explore the optimal 
branch. 

Proof. Consider the event £ under which \Xi^ ni — Xi >ni \ < c ni for all expanded nodes 1 < 
i < n, all times rii > 1, and all rounds n > 1. The confidence interval c ni are such 
that F (£) > 1 — [3. At round n, let i be a node of depth d. Let ft be a depth such 
that 6~h < Af. This is satisfied for all integer h > ^^(/t^ ■ Similarly to Flat UCB, we 
deduce that the number of times n.j a node j of depth h has been visited is bounded by 
401og(2n(n + l)f3~ 1 /(A i - S h ) 2 )/(Ai - S h ) 2 . Thus i has been visited at most 

. 9 ,_ d 401og(2n(n + l)/3- 1 /(A. 1 - ( 5 /l ) 2 ) 
rii < mm 2 



tosWA,) (Ai - 5h) 2 

— los(l/7) 

This function is minimized (neglecting the log term) for h = log C^ 2 ^^ )/ log(l/7), which 
leads to (HU1). " □ 



For illustration, Figure [5] shows the tree obtained applied to the function optimization 
problem of previous section, after n — 300 rounds. The most in-depth explored branches 
are those with highest value. 



7 Conclusion 

We analyzed several UCB-based bandit algorithms for tree search. The good exploration 
exploitation tradeoff of these methods enables to return rapidly a good value, and improve 
precision if more time is provided. BAST enables to take into account possible smoothness 
in the tree to perform efficient "cuts'll] of sub-optimal branches, with high probability. 

If additional smoothness information is provided, the 8 term in the bound ([S]) may be 
refined, leading to improved performance. Empirical information, such as variance estimate, 
could improve knowledge about local smoothness which may be very helpful to refine the 
bounds. However, it seems important to use true upper confidence bounds, in order to avoid 
bad cases as illustrated in regular UCT. Applications include minimax search for games in 
large trees, and global optimization under uncertainty. 

1 Note that this term may be misleading here since the UCB-based methods described here never explicitly 
delete branches 
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